Skip to content

Spec 038 Phase 3 — Class-A rules, correctness fixes, EC3 results, template typo fix#250

Merged
codemonkeychris merged 12 commits into
mainfrom
eval/spec-038-ec3-2026-05-11
May 12, 2026
Merged

Spec 038 Phase 3 — Class-A rules, correctness fixes, EC3 results, template typo fix#250
codemonkeychris merged 12 commits into
mainfrom
eval/spec-038-ec3-2026-05-11

Conversation

@codemonkeychris
Copy link
Copy Markdown
Collaborator

@codemonkeychris codemonkeychris commented May 12, 2026

Summary

  • Phase 3 (Tier-3 rules): three new Class-A induced rules driven by the cross-agent reproducibility audit (gpt-5.5 + claude-sonnet-4.6 × 525-run corpora each) — GridSizeFactoryParensRule (CS1955, 146 events combined, top freq in both corpora, first cross-tier rule), GridSizePxRenameRule (CS0117, 9 events), TextBlockStyleHintRule (CS1061/CS0117, 5 events across two syntactic shapes). ThemeBackgroundSuffixRule reclassified Class-B → Class-A in the file-header comment.
  • Two critical correctness fixes uncovered by end-to-end smoke testing — both were silently no-op'ing every rule firing in production despite all unit tests passing:
    1. CompilationLoader now resolves ProjectReference outputs from project.assets.json's libraries.<id> entries with type=project. Without this, Reactor itself (a project reference for every sample app) was invisible to RuleSymbolResolver and every rule's DeclaredTargets failed — the whole registry self-disabled on every real invocation.
    2. Suggest-gate carve-out for Tier-3 rules. SuggesterOrchestrator takes tier2Enabled: bool; Tier-3 rules always run when their diagnostic code surfaces, Tier-2 stays gated. EC2 watch-item ("Phase-3 rules are the right lever — not Phase-2.x gate tuning") finally addressed in code.
  • Template-identity typo fix (Micrsoft.UI.Reactor.CSharpMicrosoft.UI.Reactor.CSharp in template.json) — was breaking dotnet new reactorapp resolution against accumulating template caches; hit 20/20 runs across both arms of the EC3-original batch.
  • Template-shape cleanup for agent reasoning: removed <ImplicitUsings>enable</ImplicitUsings> from the scaffolded csproj; explicit using directives (System + Microsoft.UI.Reactor + .Core + .Layout + Xaml + Xaml.Controls + static Factories) baked into App.cs. Every symbol now traces to a visible using at the top of the file. The starter App.cs only uses three of the seven imports — the rest are there for the namespaces the agent reaches for within the first few turns of any real app.
  • Skill streamlining: SKILL.md (top-level + reactor-getting-started plugin copy) gains the anti-probe + mur check pointer paragraphs that the EC3 trace analysis identified as load-bearing. reactor-getting-started Tier-1 trims (509 → 415 lines, −18%) — dropped the single-file dotnet run minimal-app block, the standalone csproj xml, the Mode-detection section duplicated by top-level SKILL.md, the App-entry-point section, and the package-cache directory tree. No load-bearing content removed; all five cuts have breadcrumb pointers to where the displaced content lives.
  • Cross-agent audit + EC3 results + reference doc. Audit at docs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md closes Data Checkpoint C's reproducibility bar. Reference doc docs/reference/mur-check-did-you-mean.md expanded to cover Phase 2 + 3, the cross-agent mining methodology, the gate carve-out, and the ProjectReference fix. EC3-original (PASS-with-caveats) + EC3-final (clean PASS) results recorded in docs/specs/tasks/038-mur-check-did-you-mean-implementation.md.

Phase 3 V1 ship verdict

EC3-final clean PASS landed 2026-05-12 — supersedes the EC3-original PASS-with-caveats verdict captured under that batch's contaminated-substrate run. Clean batch on eval/spec-038-ec3-2026-05-11 HEAD against the existing n=5 baseline:

Metric calc-variant (n=5) kanban-variant (n=5)
Tokens mean (Δ vs base) 195,477 (−33.7%) 387,236 (−21.2%)
Tokens median (Δ vs base) 180,040 (−37.1%) 400,466 (+31.7%, see below)
Tokens CV 28.4% 19.5% (vs base 74%)
Cost mean USD (Δ vs base) $1.92 (−25.6%) $3.12 (−25.7%)
Turns mean (Δ vs base) 6.4 (−2.2) 10.4 (0)
First-build OK 5/5 5/5
failedToolCalls 0 0
Template / cache failures 0 / 0 (one auto-recovered retry) 0 / 0

The kanban-median +31.7% delta is a distribution-tightening story, not a regression story. Base kanban distribution was 263K–1,118K tokens (CV 74%), bimodal — most runs sat near the floor, one r1 blowout dragged the mean while the median stayed artificially low at 304K. Variant kanban is 261K–464K (CV 19.5%), no fat tail, every run within 1.8× of best. The load-bearing finding is the 4× CV improvement, which is the predictability-as-a-feature signal the spec §11 risk row called out as deployable-workflow value (separate from any token-mean win). Second batch in a row (after EC1-RR) where this mechanism shows up; first batch where calc also tightens.

All four pass criteria cleared:

# Criterion Result
1 Tokens improve ≥ 5% on at least one arm Pass — both arms
2 First-build OK ≥ 5/5 on both variants Pass (5/5, 5/5)
3 No false-positive rule fires Pass with low confidencefailedToolCalls 0/0; §11 guardrail retrofit (post-run mur check --final audit) still deferred for high-confidence assertion
4 CV ≤ EC1-RR Pass (kanban 19.5% vs EC1-RR 54%)

Full results table at docs/specs/tasks/038-mur-check-did-you-mean-implementation.md § "EC3-final results — 5×N landed 2026-05-12".

One footnote worth recording

EC3-original measured 0/10 firings on the three new Class-A rules (GridSizeFactoryParensRule / GridSizePxRenameRule / TextBlockStyleHintRule). EC3-final doesn't break out per-rule counts, so we can't say whether the clean-PASS win includes any contribution from those three rules or whether it's entirely the structural fixes + template + skill changes carrying the result. The clean PASS supersedes the EC3-original verdict regardless — the rules are correct in isolation, pass Validation Gate bars #1#4 + #6, and don't actively harm when silent. But "Phase 3 V1 shipped on Class-A rules that may not have fired in production-ish eval" is a footnote worth recording for whoever picks up this work next. The targeted-prompt batch at C:\temp\mur-targeted-prompt-spec.md is the load-bearing follow-up for empirical token-impact numbers on the three Class-A rules specifically.

Test plan

  • dotnet test tests/Reactor.Tests/Reactor.Tests.csproj -c Debug -p:Platform=x64 — 7179 passing / 46 expected skips
  • dotnet test tests/Reactor.IntegrationTests/Reactor.IntegrationTests.csproj with CreateTemplateTests filter — 2/2 passing on the corrected template identity
  • mur check --list-rules shows all six rules enabled with zero self-disables against samples/apps/wordpuzzle
  • Wordpuzzle end-to-end smoke at default --suggest-threshold 3: inject GridSize.Pixel(80) + GridSize.Auto() → both rules fire with full evidence suffixes (gate carve-out verified live)
  • mur pack-local against branch HEAD — Microsoft.UI.Reactor.0.0.0-local.nupkg carries the corrected template identity, explicit-usings App.cs, no implicit usings, and the trimmed agentkit/plugins/reactor/skills/reactor-getting-started/SKILL.md
  • Workstation ~/.templateengine cache drained of stale Micrsoft.UI.Reactor.CSharp entries and reinstalled clean
  • EC3-final clean PASS landed 2026-05-12 — superseded the EC3-original PASS-with-caveats verdict
  • Targeted-prompt batch for empirical Class-A rule validation — follow-up, doesn't block this PR's ship

Surface area

  • New rule files: src/Reactor.Cli/Check/Rules/{GridSizeFactoryParens,GridSizePxRename,TextBlockStyleHint}Rule.cs
  • Modified: src/Reactor.Cli/Check/{CompilationLoader,SuggesterOrchestrator,CheckCommand}.cs, src/Reactor.Cli/Check/Rules/ThemeBackgroundSuffixRule.cs
  • New tests: rule fixture pairs for the three new rules + RulePerformanceTests.cs (§3.1a perf bound) + TemplateMetadataTests.cs (typo regression)
  • Modified tests: CompilationLoaderTests.cs, SuggesterOrchestratorRuleTests.cs
  • Template: tools/Templates/templates/WinUIApp-CSharp/.template.config/template.json (typo fix), Company.ReactorApp1.csproj (drop ImplicitUsings), App.cs (explicit usings)
  • Skills: top-level SKILL.md, plugins/reactor/skills/reactor-getting-started/SKILL.md (Tier-1 trims, anti-probe note, mur check pointer, canonical-usings sync)
  • Docs: docs/reference/mur-check-did-you-mean.md (expanded through Phase 2 + 3 + cross-agent methodology); docs/specs/tasks/038-mur-check-did-you-mean-implementation.md (status snapshot, EC3-original + EC3-final results, cross-agent audit verdicts); new docs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md
  • CHANGELOG entries under ## [Unreleased]

🤖 Generated with Claude Code

codemonkeychris and others added 11 commits May 11, 2026 18:41
Three new Class-A induced rules, motivated by the cross-agent audit at
docs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md:

- GridSizeFactoryParensRule (CS1955; 146 events combined gpt-5.5 +
  sonnet-4.6; first cross-tier rule since CS1955 is outside Tier-2
  SupportedCodes): GridSize.Auto() -> GridSize.Auto (drop the parens).
- GridSizePxRenameRule (CS0117; 9 cross-agent events): GridSize.Pixel /
  Pixels / Fixed -> GridSize.Px (WPF/WinUI legacy name -> Reactor's Px).
- TextBlockStyleHintRule (CS1061/CS0117; 5 cross-agent events across
  both .Style(...) and `with { Style = ... }` shapes): hint toward
  Reactor's fluent text helpers since the element exposes no Style.

ThemeBackgroundSuffixRule reclassified Class-B -> Class-A (paperwork
only; cross-agent audit shows 27 events on the same key).

Two critical correctness fixes uncovered by end-to-end smoke testing —
both blocked any real-world rule firing before this commit:

1. CompilationLoader.ResolveReferences now walks libraries.<id> entries
   with type=project in project.assets.json and locates the most-recently
   -built matching .dll under that project's bin/ tree. Without this
   every rule's DeclaredTargets failed to resolve and the whole registry
   self-disabled on real mur check invocations (unit tests passed because
   they use synthetic in-memory compilations). Regression locked by
   CompilationLoaderTests.Resolves_ProjectReference_built_dll_from_project_assets_json.

2. SuggesterOrchestrator gains a tier2Enabled bool; CheckCommand.Run
   always builds the orchestrator (when the compilation loads) and passes
   the suggest-gate result in as tier2Enabled. Tier-3 rules always run
   when their diagnostic code surfaces; Tier-2 stays gated on small
   builds where its fuzzy match has near-0% precision (525-run
   calibration). This is the EC2 watch-item ("Phase-3 rules are the
   right lever — not Phase-2.x gate tuning") finally addressed in code.
   Two new orchestrator tests lock down both halves of the carve-out.

§3.1a per-rule performance bound test landed (was deferred until first
rule shipped): RulePerformanceTests.BestMatch_median_under_per_rule_budget
asserts symbol-resolution + TryMatch median <= 0.5 ms per rule per
diagnostic times 4 CI slack.

Status snapshot in the implementation tasks doc updated to record the
sonnet-4.6 corpus aggregation (368 fixes / 564 ranker rows / 41 clusters),
the cross-agent audit verdicts (3 STRONG Class-A targets, plus
TemplatedListView family that's STRONG-after-generalization-over-<T>,
plus the gpt-5.5-only CS1955/GridElement family deferred to a third
corpus drop), and the rule-PR queue with this commit's three Class-A
rules marked authored.

Branch is for spec 038 EC3 eval — see C:\temp\mur-ec3-handoff.md.

Full Reactor.Tests suite: 7175 passing / 46 expected skips.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The reference doc was scoped to Phase 0+1 plus the suggest-gate. This
update extends it to cover everything shipping since:

- Phase 2 (merged): MSBuild passthrough via `--`, mode flags
  (--strict/--final/--quiet/--emit-threshold), the deterministic
  pre-emit policy table, and the suppress-to-error guardrail tool.
- Phase 3 (in flight on this branch): the IRulePattern infrastructure,
  RuleSymbolResolver / RuleRegistry, --disable-rule + --list-rules CLI
  surface, six authored rules (three Class-A induced + three Class-B
  vocabulary), the symbol-binding contract from §3.1a, and the
  per-rule perf bound test.
- Two critical correctness fixes uncovered during Phase 3 end-to-end
  smoke testing: CompilationLoader's ProjectReference resolution path
  and the suggest-gate carve-out for Tier-3 rules. Both get their own
  subsections in §3 explaining why unit tests passed while production
  silently no-op'd, since that failure mode generalises beyond this
  spec.
- The cross-agent mining drop (`claude-sonnet-4.6` × 525 runs) and the
  audit it produced. New subsection in §4 on comparing models to
  separate structural vocabulary-confusion signals from agent-specific
  idiosyncrasies; new subsection in §5 on what the second-agent corpus
  changed (B->A promotions, single-corpus deferrals, cross-syntactic-
  shape rule emergence).

§9 (Future improvements) tightened to what's actually left: remainder
of Phase 3 (more rules pending a third-agent corpus + Class-B catalog
expansion), Phase 4 (telemetry + learned ranker, blocked on Data
Checkpoint D), and a "what EC3 will tell us" subsection that frames
EC3 as a fresh measurement rather than an incremental delta on EC2.

Glossary gains: rule carve-out, pre-emit ranker, symbol-binding
contract, ProjectReference resolution, cross-agent audit, provenance.
TOC updated for the §8 rename ("in this PR" -> "so far").

Tone matches the existing doc: plain language first, then engineer
detail, then ML-practitioner detail. The same explanatory pattern
spec 038's design doc uses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PASS-with-caveats. Calc cleared the >=5% token-improvement bar
(-5.2% mean, -13.0% median); kanban regressed (+14.9% mean, +60.7%
median). The three Class-A rules added in this branch fired zero
times across all 10 variant runs - the EC3 delta did not exercise.
The calc improvement is plausibly driven by the CompilationLoader
+ gate carve-out fixes letting rules run at all, not by the new
rules.

Tool-call profile diff identifies the +3.2 turn delta on kanban:
variant agent does ~+1 skill load, +1 view, +1 apply_patch per run
vs base, consistent with a "verify-before-edit" loop triggered by
rule suggestions. Mechanism cited in handoff section 7.

Recommend: do not declare Phase 3 cleared on this batch alone.
Re-run with prompts that target GridSize/TextBlock patterns to
get Class-A rule exercise; investigate the kanban-base R1 outlier
(1.12M tokens, 3.4x median) before reading the kanban regression
as decisive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…edit framing

Per-call inspection of variant kanban rg patterns and view paths
doesn't support the earlier "verify-before-edit" hypothesis: ~11/12 rg
calls probe drag/drop and modifier APIs unrelated to mur output; view
calls are mostly the agent re-reading its own in-progress workspace
files. The two rule-fired runs (r1=Theme, r4=Align) are middle-of-the-
pack on turns and tools, not the heaviest. The variant mean is dragged
up by r5 (20 turns, 27 tool calls, 889K tokens, zero rule fires) which
looks like a generic long-tail trajectory comparable to base R1.

Reframing: rule fires correlate with normal token usage when they
happen; mur check can't help on builds where the agent's mistakes
fall outside the rule set's coverage. The kanban-prompt -> rule-
coverage gap is the underlying issue, not rule-induced verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-character bug in tools/Templates/templates/WinUIApp-CSharp/
.template.config/template.json lines 5-6: the template's identity and
groupIdentity were "Micrsoft.UI.Reactor.CSharp" / "Micrsoft.UI.Reactor"
(missing the second 'o'). Checked into the repo since at least Phase 1.

How it surfaced. The eval harness runs `dotnet new install ... --force`
on every setup. Multiple installs accumulate duplicate entries in the
user's ~/.templateengine/dotnetcli/<sdk>/templatecache.json under the
misspelled identity. The duplicate-match condition makes
`dotnet new reactorapp` resolve more than one template for the
"reactorapp" short name, throwing "Sequence contains more than one
matching element" with exit code 70. The EC3 5x2 batch hit this 20/20
runs across both arms — the spec doc's earlier framing ("agent typo,
at least one variant kanban run, didn't block the build") was wrong on
three counts; corrected in this commit.

Why the existing integration test (CreateTemplateTests) didn't catch
the typo: it installs the template into a per-test ephemeral
--debug:custom-hive, where the misspelled identity is the only entry
and `dotnet new` resolves correctly. The bug only surfaces against the
user's real (accumulating) cache. The new test (described below) is
content validation, not install/run behavior — orthogonal coverage
that catches the typo regardless of cache state.

Test added: tests/Reactor.Tests/TemplateMetadataTests.cs. Four xUnit
[Fact]s that load template.json directly:
  - Identity_is_canonical_brand_namespace: exact-match assertion
    against "Microsoft.UI.Reactor.CSharp".
  - GroupIdentity_is_canonical_brand_namespace: exact-match against
    "Microsoft.UI.Reactor".
  - File_contains_no_brand_typos: substring sweep for "Micrsoft"
    anywhere in the file (belt-and-suspenders catch for future typos
    in any new symbol/description/etc.).
  - ShortName_resolves_to_reactorapp: anchors the public CLI command
    name documented in SKILL.md and the wordpuzzle smoke pattern.

Workstation cache drained + reinstalled: `dotnet new uninstall
Microsoft.UI.Reactor.ProjectTemplates` repeated until empty, then
`mur pack-local` repacked against the fixed template, then
`dotnet new install` reinstalled. ~/.templateengine cache now carries
exactly one canonical "Microsoft.UI.Reactor.CSharp" entry across both
SDK versions on disk (10.0.104, 10.0.203).

Existing tests unaffected: Reactor.Tests 7179 passing / 46 expected
skips (up from 7175, +4 from the new template-metadata tests).
CreateTemplateTests integration smoke (`dotnet new reactorapp` + build
+ run + UI Automation find) passes 2/2 with the corrected identity.

EC3 verdict implication: both arms hit the typo equally, so the
relative deltas (calc -5.2%, kanban +14.9%) are not biased *by this
bug*. Absolute costs are inflated on every run; the long-tail outliers
(variant r5 = 889K tokens, base r1 = 1.12M tokens) likely had their
trajectories pushed further by `dotnet-new` thrash. The PASS-with-
caveats verdict still stands directionally; a re-run with the typo
fixed could materially shift the numbers in either direction. Spec doc
updated to reflect this.

Two harness-side mitigations deferred to separate follow-ups (the
source typo is the load-bearing fix; without it the harness mitigations
would still leak):
  1. `dotnet new uninstall Microsoft.UI.Reactor.ProjectTemplates`
     before `dotnet new install --force` in eval setup, so future
     typo-equivalent bugs can't accumulate.
  2. Propagate inner-command exit codes into the PowerShell tool
     wrapper's `success` field so `failedToolCalls` stops lying about
     dotnet-new failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EC3-post-typo-fix smoke trace analysis identified two reads following
the scaffold: `view App.cs` (essential — the file the agent is about
to apply_patch) and `view <project>.csproj` (defensive — the agent
checking that the scaffold produced a sane csproj). The .csproj read
is informational at best: the scaffold's stdout already showed the
file listing plus "Restore succeeded.", and a calc/kanban-shaped task
never modifies the .csproj.

Across the prior 10 variant runs, calc averaged 2.2 views/run and
kanban averaged 2.2 (r5's 4 reads pulling the kanban mean up). Two
views post-scaffold is the modal pattern, so a one-line skill note
landing on the defensive read should compress noticeably.

Added the same one-line note in two places so both skill consumers see
it:
- plugins/reactor/skills/reactor-getting-started/SKILL.md right after
  the canonical .csproj block (line ~102, next to the
  WindowsPackageType / UseWinUI MUST rules).
- SKILL.md (top-level, packed into the nupkg) right after the matching
  csproj block in the Project Setup section.

The wording explicitly carves out App.cs as still-necessary so the
note doesn't suppress useful reads. Estimated savings: one view + a
few hundred tokens per scaffold step. Small per-run, real across the
batch since every eval scaffolds.

Repacked Microsoft.UI.Reactor.0.0.0-local.nupkg so the bundled
agentkit/SKILL.md carries the update.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anti-probe + mur check pointer

Three related changes addressing the post-scaffold agent-confusion pattern
the EC3-post-typo-fix trace analysis identified:

1. tools/Templates/templates/WinUIApp-CSharp/Company.ReactorApp1.csproj
   removes <ImplicitUsings>enable</ImplicitUsings> and the three <Using
   Include="..."> items. Reactor's namespaces are now explicitly
   `using`-imported at the top of App.cs:
       using Microsoft.UI.Reactor;
       using Microsoft.UI.Reactor.Core;
       using static Microsoft.UI.Reactor.Factories;
   Why: with implicit usings on, the source file looks like it's missing
   namespace context — `VStack`, `Heading`, `Component` appear unqualified
   without a visible `using`, which confuses agents reasoning about where
   symbols come from. The agent has to read the csproj to find the global
   Using items, then mentally merge them into App.cs's namespace scope.
   Explicit usings make App.cs self-contained: every symbol's source is
   one of the three using directives at the top of the file. The skill
   text now says "App.cs has its own using directives at the top, which is
   the only place you add new namespaces" — which is true after this
   change.

2. SKILL.md + plugins/reactor/skills/reactor-getting-started/SKILL.md
   expand the existing "trust the scaffolded .csproj" note into an
   anti-probe paragraph that enumerates the exact post-scaffold file
   list: "the workspace contains exactly two source files: App.cs (entry
   point + initial component) and <Name>.csproj. There is no Program.cs
   and no GlobalUsings.cs — modify App.cs in place."
   Why: the eval orchestrator's trace analysis identified a recurring
   "agent probes for files that don't exist" pattern (sometimes asking
   for Program.cs, sometimes inspecting obj/GlobalUsings.g.cs). Pinning
   the file list in the skill is a one-paragraph fix.

3. Same two SKILL.md files add a 1-paragraph mur check pointer alongside
   the anti-probe note: "Verify your edits with mur check before
   declaring done... For anything more involved than the build/fix loop —
   strict-mode failures, custom diagnostic gating, MSBuild passthrough
   flags — load the reactor-build-and-check skill."
   Why: the deeper reactor-build-and-check skill is a heavy load (full
   --strict / --final / --quiet / --emit-threshold / --suppress-error
   surface plus the iter/final framing). Most agent runs just need the
   basic loop. Promoting mur check into getting-started with a one-liner
   for the basic case lets the agent stay in the lighter skill until
   they actually hit advanced behavior.

Verified: dotnet new reactorapp -n X builds clean in both the top-level-
program default and the --use-program-main true variant. Existing
CreateTemplateTests integration smoke (2/2) and TemplateMetadataTests
unit tests (4/4) pass. mur pack-local refreshes both nupkgs against the
new template + skill content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior commit (94d563f) added three usings to App.cs (Microsoft.UI.
Reactor, .Core, static Factories) so the source is self-contained after
dropping <ImplicitUsings>. The skill's "Required imports" section
documents the full *canonical* set as five-plus-one — adding
Microsoft.UI.Reactor.Layout, Microsoft.UI.Xaml, and
Microsoft.UI.Xaml.Controls to the minimum three. The template and the
skill now diverged: the agent reading App.cs would see three usings
but the skill text says the canonical set is six.

Sync the template to the skill: App.cs now ships all five-plus-one
using directives, with the same `// FlexDirection, FlexJustify, ...`
inline comments the skill uses for each non-obvious namespace. The
starter App.cs still only uses three of them (Reactor, Core, static
Factories); the other three are there because the agent will reach
for them within the first ~5 turns of any real app (alignment enums,
InfoBarSeverity, FlexDirection).

Updated the SKILL.md anti-probe paragraph in both copies to point at
`using System.Linq;` as the example of "when you add a new namespace,
add it to App.cs's using block" — System.Linq is a real common add and
isn't in the canonical six, so the example stays accurate. The top-
level SKILL.md also explicitly names the canonical set so readers can
cross-reference without flipping to the imports section.

Verified: dotnet new reactorapp -n X builds clean in the default
variant. CreateTemplateTests integration smoke 2/2 and
TemplateMetadataTests 4/4 pass against the expanded usings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With ImplicitUsings disabled (template commit 94d563f) the agent doesn't
get System auto-imported. Common BCL surface — Action, Func, EventArgs,
DateTime, Math, TimeSpan, Random — all live there, and they show up
within the first few turns of any non-trivial app (event handlers,
timers, randomization, formatting). Adding `using System;` to the
template's App.cs eliminates the "Action does not exist in the current
context" miss that's otherwise the first thing the agent hits when
they author an event handler.

Synced the canonical set in three places so they stay coherent:
- tools/Templates/templates/WinUIApp-CSharp/App.cs (scaffold output)
- plugins/reactor/skills/reactor-getting-started/SKILL.md "Required
  imports" code block
- SKILL.md anti-probe note's parenthetical canonical-set list

Verified: scaffolded App.cs ships `using System;` at the top of the
canonical seven-line using block; default-variant build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five surgical removals identified by the analysis pass:

A. "Minimal app — single file" block (36 lines → 5). The single-file
   `dotnet run App.cs` flow is a side path now that `dotnet new
   reactorapp` is the primary entry; the canonical-shape teaching
   moved into the scaffolded App.cs. Kept a 1-paragraph pointer to
   reactor-build-and-check's single-file-scripts section for the
   demo case.

B. Standalone `.csproj` xml block (17 lines dropped). The xml taught
   the agent how to write a csproj from scratch — but the agent
   doesn't author one. `dotnet new reactorapp` produces it. Kept the
   "when to use a .csproj" framing + the WindowsPackageType /
   UseWinUI MUST-rules + the recently-added anti-probe + mur check
   paragraphs.

C. "Mode detection — selfhost vs. NuGet consumer" section (29 lines
   → 2). The top-level SKILL.md already owns selfhost/consumer
   bootstrap; re-explaining it here was a second copy. The new
   one-paragraph "Bootstrap" section breadcrumb-points readers to
   SKILL.md and keeps the load-bearing `mur pack-local` recovery
   tip inline.

D. "App entry point" section (13 lines → 0). The ReactorApp.Run<App>
   form is already in the scaffolded App.cs. The unique content was
   the inline-render-function form `ReactorApp.Run("T", ctx => ...)`
   — embedded that as a one-line addendum to §Components instead of
   carrying a whole section for it.

E. "Where the skill content comes from" package-cache directory tree
   (6 lines dropped). The literal `%USERPROFILE%\.nuget\...` block
   was reference material an agent can `find` on demand. Kept the
   plugin-channel framing + the api-index pointer + the "read once,
   cache in working memory" tip.

What's preserved unchanged:
- The React→Reactor table (highest-value block in the file)
- Components / Hooks / Common factories / Theme tokens / Critical
  gotchas (load-bearing reference content)
- The recent anti-probe + mur check paragraphs
- The trimmed sections still carry their breadcrumb pointers so
  agents looking for the removed content find their way to the
  right skill (reactor-build-and-check, top-level SKILL.md, etc.)

Tier 2 (move drag-and-drop to reactor-input, trim Context, drop
duplicate List/UseReducer callout) and Tier 3 (move ContentDialog +
Flyout to reactor-recipes) are follow-up considerations, not applied
in this commit — they want eval validation before landing.

Repacked Microsoft.UI.Reactor.0.0.0-local.nupkg so the bundled
agentkit/plugins/reactor/skills/reactor-getting-started/SKILL.md
carries the trimmed file (verified: nupkg copy is 415 lines, matches).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Append the EC3-final results subsection to the implementation tasks
doc; mark EC3-original as superseded (preserved as the historical
record of the typo-contaminated batch and the PASS-with-caveats
reasoning that drove the watch-item triage work).

EC3-final headline numbers (5×N landed 2026-05-12 on
eval/spec-038-ec3-2026-05-11 @ 053afe9):
  calc:   tokens −33.7% mean / −37.1% median, cost −25.6%,
          turns −2.2, CV 28.4%, first-build 5/5
  kanban: tokens −21.2% mean (median +31.7% is the
          distribution-tightening artifact, not a regression —
          base CV 74% bimodal vs variant CV 19.5% no-fat-tail),
          cost −25.7%, turns 0, first-build 5/5

The 4× kanban CV improvement is the load-bearing finding — second
batch in a row (after EC1-RR) where the predictability-as-a-feature
signal shows up, first batch where calc also tightens.

All four EC3 pass criteria cleared. Spec §12's "~−$0.70 per run"
prediction comfortably exceeded on both arms ($0.66 calc, $1.08
kanban). Spec EC3 row's "~−2 turns" prediction hits calc exactly.

One unresolved footnote: per-rule firing counts weren't broken out
in this batch. EC3-original was 0/10 on the three new Class-A rules;
this clean PASS may be carried entirely by the structural fixes +
template + skill changes with the three new rules still inert. The
verdict supersedes EC3-original regardless (rules are correct in
isolation, pass bars #1-#4 + #6, don't actively harm when silent),
but the targeted-prompt batch at C:\temp\mur-targeted-prompt-spec.md
remains the load-bearing follow-up for getting empirical token-
impact numbers on those three rules specifically.

Watch-items carried into V1 / Phase 4 review:
- Class-A rule exercise via targeted-prompt batch
- §11 risk-row guardrail retrofit (post-run mur check --final audit)
- Tier-2 SKILL.md trims (now empirically de-risked)
- rule_fired trace event addition

Verdict: PASS, clean. Phase 3 V1 cleared to ship. PR #250 updated
with the same verdict in its body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR advances mur check’s “did-you-mean” engine (spec 038 Phase 3) by adding three new Class‑A Tier‑3 rules, fixing two production-blocking correctness gaps (ProjectReference reference resolution + Tier‑3 suggest-gate carve-out), and repairing/streamlining the dotnet new reactorapp template and associated skill/docs/test coverage.

Changes:

  • Add three new Tier‑3 Class‑A induced rules (GridSize parens, GridSize Px rename, TextBlock Style hint) plus fixture tests and a perf-bound test.
  • Fix real-world rule execution by resolving ProjectReference outputs from project.assets.json and by gating Tier‑2 only (Tier‑3 rules always run when their codes surface).
  • Fix template identity typo and adjust the template shape (drop implicit/global usings; add explicit using block in App.cs), plus skill/docs/changelog updates.
Show a summary per file
File Description
tools/Templates/templates/WinUIApp-CSharp/Company.ReactorApp1.csproj Removes implicit/global usings from the scaffolded project.
tools/Templates/templates/WinUIApp-CSharp/App.cs Adds explicit using directives for the scaffolded starter app.
tools/Templates/templates/WinUIApp-CSharp/.template.config/template.json Fixes template identity/groupIdentity typo (MicrsoftMicrosoft).
tests/Reactor.Tests/TemplateMetadataTests.cs Adds unit tests guarding template metadata/branding invariants.
tests/Reactor.Tests/CheckCommandTests/SuggesterOrchestratorRuleTests.cs Adds tests asserting Tier‑3 rules still fire when Tier‑2 is suggest-gated off.
tests/Reactor.Tests/CheckCommandTests/Rules/TextBlockStyleHintRuleTests.cs Fixture tests for TextBlockStyleHintRule (positive + negative).
tests/Reactor.Tests/CheckCommandTests/Rules/RulePerformanceTests.cs Adds perf bound test for per-rule BestMatch cost.
tests/Reactor.Tests/CheckCommandTests/Rules/GridSizePxRenameRuleTests.cs Fixture tests for GridSizePxRenameRule (positive + negative).
tests/Reactor.Tests/CheckCommandTests/Rules/GridSizeFactoryParensRuleTests.cs Fixture tests for GridSizeFactoryParensRule (positive + negative).
tests/Reactor.Tests/CheckCommandTests/CompilationLoaderTests.cs Adds regression test for resolving ProjectReference-built DLLs via assets.json.
src/Reactor.Cli/Check/SuggesterOrchestrator.cs Introduces tier2Enabled gating (Tier‑2 only) while always allowing rules.
src/Reactor.Cli/Check/Rules/ThemeBackgroundSuffixRule.cs Updates rule header docs to reflect Class‑A evidence/reclassification.
src/Reactor.Cli/Check/Rules/TextBlockStyleHintRule.cs New Tier‑3 rule for missing TextBlockElement.Style patterns.
src/Reactor.Cli/Check/Rules/GridSizePxRenameRule.cs New Tier‑3 rule mapping legacy Pixel/Pixels/FixedPx.
src/Reactor.Cli/Check/Rules/GridSizeFactoryParensRule.cs New Tier‑3 rule for GridSize.<property>() CS1955 parens removal.
src/Reactor.Cli/Check/CompilationLoader.cs Resolves ProjectReference outputs by scanning libraries entries in assets.json.
src/Reactor.Cli/Check/CheckCommand.cs Loads compilation once and wires tier2Enabled through the orchestrator.
skills/reactor.api.txt Updates API index content (new surfaced APIs).
SKILL.md Updates top-level skill guidance (anti-probe + mur check workflow notes).
plugins/reactor/skills/reactor-getting-started/SKILL.md Trims/reshapes getting-started skill and synchronizes scaffold/import guidance.
plugins/reactor/skills/reactor-dsl/references/reactor.api.txt Updates packaged API index copy.
docs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md Adds cross-agent reproducibility audit writeup.
docs/specs/tasks/038-mur-check-did-you-mean-implementation.md Updates spec task status/results narrative through EC3 findings.
docs/reference/mur-check-did-you-mean.md Expands reference doc to cover Phase 2–3 behavior and fixes.
CHANGELOG.md Records new rules and correctness fixes under Unreleased.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 25/25 changed files
  • Comments generated: 2

Comment thread tests/Reactor.Tests/CheckCommandTests/Rules/RulePerformanceTests.cs Outdated
Comment thread src/Reactor.Cli/Check/CheckCommand.cs
…on non-suggestable builds

Copilot review surfaced two substantive issues; both fixed.

(1) RulePerformanceTests CombinedStub only carried targets for the
three earlier Class-B rules. With RuleRegistry.Default now including
the three new Class-A rules (GridSizeFactoryParens, GridSizePxRename,
TextBlockStyleHint), those three were silently self-disabling during
the perf test — TargetsResolve failed against the stub's missing
GridSize and TextBlockElement types. The 'budget = 0.5ms × ruleCount'
assertion then scaled by registry.All.Length (six) while only
measuring three rules' actual cost, so the bound was 2× loose.

Fix: extended the stub with Microsoft.UI.Reactor.GridSize (record
struct with Auto/Star/Px matching the real shape) and
Microsoft.UI.Reactor.Core.TextBlockElement (record). Added a
stub-coverage guard at the top of the perf test that asserts every
rule in RuleRegistry.Default.All resolves its declared targets
against the test compilation — fails loudly with the missing target
name and rule name if someone adds a new rule without updating the
stub. Future-proofs the budget assertion.

(2) CheckCommand.Run unconditionally called
CompilationLoader.Instance.Load(path) after the EC3 gate carve-out
refactor, even when no parsed diagnostic could plausibly produce a
suggestion (no diagnostics at all; only Tier-2 codes with the gate
closed and no rule covering them; only nullable/XML-doc warnings).
The compilation load is 50–500 ms cold — .cs enumeration, file-set
hash, full reference resolution including the new ProjectReference
walk. Paying it on every clean mur check was wall-time regression
on the happy path.

Fix: added SuggesterOrchestrator.AnyDiagnosticIsSuggestable(diags,
tier2Enabled, rules) — flat scan over the (small) diagnostic list
against the union of Tier-2's SupportedCodes and every rule's
DiagnosticCodes. Microseconds. CheckCommand.Run now gates the
compilation load behind that pre-check: only loads when at least one
diagnostic could plausibly produce a suggestion.

Test coverage:
  - RulePerformanceTests: stub-coverage guard asserts every
    DeclaredTarget across RuleRegistry.Default.All resolves.
  - SuggesterOrchestratorRuleTests gains 5 new facts:
      * empty diag list → false (clean build skips load)
      * unrelated CS warnings (CS8602/CS8618) → false
      * CS1061 + tier2Enabled=true → true
      * CS1061 + tier2Enabled=false + no rule → false (gate-closed
        Tier-2-only path is non-suggestable)
      * CS1955 covered by rule + tier2Enabled=false → true (Tier-3
        always runs)

Verified:
  - Reactor.Tests 7184 passing / 46 expected skips (was 7179, +5).
  - CreateTemplateTests integration smoke 2/2.
  - Clean wordpuzzle mur check exits with no output (pre-check
    short-circuits — no compilation load).
  - Wordpuzzle with GridSize.Pixel(80) + GridSize.Auto() injected:
    both rules still fire under the default gate with full evidence
    suffixes. Pre-check correctly identifies the build as
    suggestable; nothing regressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codemonkeychris
Copy link
Copy Markdown
Collaborator Author

Both Copilot CR comments addressed in e4e7c7d.

1. RulePerformanceTests stub coverage (line 35). Confirmed real — the stub carried targets for the three earlier Class-B rules but not for the new GridSize / TextBlockElement Class-A rules, so three rules were silently self-disabling and the budget = 0.5ms × registry.All.Length assertion was scaling by 6 while only measuring 3 rules' actual cost (2× loose bound).

Fix: extended CombinedStub with Microsoft.UI.Reactor.GridSize (record struct, Auto/Star/Px) and Microsoft.UI.Reactor.Core.TextBlockElement (record). Added a stub-coverage guard at the top of the perf test that asserts every rule in RuleRegistry.Default.All resolves its declared targets against the test compilation — fails loudly with the missing target name + rule name if someone adds a new rule without updating the stub. Future-proofs the budget assertion against the next Class-A wave.

2. CheckCommand.Run unconditional compilation load (line 158). Confirmed real — after the EC3 gate carve-out refactor, the compilation load happens on every invocation including clean builds, builds where only nullable/XML-doc warnings surfaced, and builds where Tier-2 is gated and no rule covers the codes. 50–500ms wall-time regression on the happy path.

Fix: added SuggesterOrchestrator.AnyDiagnosticIsSuggestable(diags, tier2Enabled, rules) — a flat scan over the (small) diag list against the union of Tier-2's SupportedCodes and every rule's DiagnosticCodes. Microseconds. CheckCommand.Run gates the compilation load behind that pre-check. Five new tests in SuggesterOrchestratorRuleTests cover the truth table:

  • empty diag list → false (clean build skips load)
  • unrelated CS warnings (CS8602/CS8618) → false
  • CS1061 with tier2Enabled=true → true
  • CS1061 with tier2Enabled=false and no rule covering → false (gate-closed Tier-2-only path is non-suggestable)
  • CS1955 covered by a rule with tier2Enabled=false → true (Tier-3 always runs regardless of the gate)

Verification. Reactor.Tests 7184/46 (was 7179, +5 for the new pre-flight facts). CreateTemplateTests integration smoke 2/2. End-to-end against samples/apps/wordpuzzle: clean build (no diagnostics) exits with no output and skips the compilation load; build with GridSize.Pixel(80) + GridSize.Auto() injected fires both rules with full evidence suffixes under the default gate — confirms the pre-check correctly classifies the build as suggestable and nothing regressed on the rule firing path.

@codemonkeychris codemonkeychris merged commit 9e0b012 into main May 12, 2026
7 checks passed
@codemonkeychris codemonkeychris deleted the eval/spec-038-ec3-2026-05-11 branch May 12, 2026 12:36
nmetulev added a commit that referenced this pull request May 18, 2026
Use the modern Windows TitleBar (drag region, system menu, themed caption) as the top-of-window element and wrap content in a Border with 24px padding. Apply the same polish to the `mur --create` scaffolder so both entry points produce a presentable starter app. Align the scaffolder's emitted usings with the dotnet new template (PR #250) so generated apps have the common WinUI/Reactor namespaces ready to go.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants